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Abstract 

Advanced Driver Assistance Systems (ADAS) have made 
driving safer over the last decade. They prepare vehicles 
for unsafe road conditions and alert drivers if they per¬ 
form a dangerous maneuver. However, many accidents are 
unavoidable because by the time drivers are alerted, it is 
already too late. Anticipating maneuvers beforehand can 
alert drivers before they perform the maneuver and also 
give ADAS more time to avoid or prepare for the danger. 

In this work we anticipate driving maneuvers a few sec¬ 
onds before they occur. For this purpose we equip a car 
with cameras and a computing device to capture the driving 
context from both inside and outside of the car. We propose 
an Autoregressive Input-Output HMM to model the contex¬ 
tual information alongwith the maneuvers. We evaluate our 
approach on a diverse data set with 1180 miles of natural 
freeway and city driving and show that we can anticipate 
maneuvers 3.5 seconds before they occur with over 80% 
FI-score in real-time. 


1. Introduction 

Over the last decade cars have been equipped with var¬ 
ious assistive technologies in order to provide a safe driv¬ 
ing experience. Technologies such as lane keeping, blind 
spot check, pre-crash systems etc., are successful in alerting 
drivers whenever they commit a dangerous maneuver ED. 
Still in the US alone more than 33,000 people die in road 
accidents every year, the majority of which are due to in¬ 
appropriate maneuvers 0. We need mechanisms that can 
alert drivers before they perform a dangerous maneuver in 
order to avert many such accidents (28). In this work we 
address this problem of anticipating maneuvers that a driver 
is likely to perform in the next few seconds (Figure [I]). 

Anticipating future human actions has recently been a 
topic of interest to both the vision and robotics communi¬ 
ties L12, 13, 42]. Figure [T] shows our system anticipating a 



Figure 1: Anticipating maneuvers. Our algorithm anticipates 
driving maneuvers performed a few seconds in the future. It uses 
information from multiple sources including videos, vehicle dy¬ 
namics, GPS, and street maps to anticipate the probability of dif¬ 
ferent future maneuvers. 

left turn maneuver a few seconds before the car reaches the 
intersection. Our system also outputs probabilities over the 
maneuvers the driver can perform. With this prior knowl¬ 
edge of maneuvers, the driver assistance systems can alert 
drivers about possible dangers before they perform the ma¬ 
neuver, thereby giving them more time to react. Some previ¬ 
ous works mmm also predict a driver’s future maneu¬ 
ver. However, as we show in the following sections, these 
methods use limited context and do not accurately model 
the anticipation problem. 

In order to anticipate maneuvers, we reason with the con¬ 
textual information from the surrounding events, which we 
refer to as the driving context. We obtain this driving con¬ 
text from multiple sources. We use videos of the driver in¬ 
side the car and the road in front, the vehicle’s dynamics, 
global position coordinates (GPS), and street maps; from 
this we extract a time series of multi-modal data from both 
inside and outside the vehicle. The challenge lies in mod¬ 
eling the temporal aspects of driving and in detecting the 
contextual cues that help in anticipating maneuvers. 

Modeling maneuver anticipation also requires joint rea¬ 
soning of the driving context and the driver’s intention. The 
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challenge here is the driver’s intentions are not directly ob¬ 
servable, and their interactions with the driving context are 
complex. For example, the driver is influenced by external 
events such as traffic conditions. The nature of these in¬ 
teractions is generative and they require a specially tailored 
modeling approach. 

In this work we propose a model and a learning algo¬ 
rithm to capture the temporal aspects of the problem, along 
with the generative nature of the interactions. Our model 
is an Autoregressive Input-Output Hidden Markov Model 
(AIO-HMM) that jointly captures the context from both in¬ 
side and outside the vehicle. AIO-HMM models how events 
from outside the vehicle affect the driver’s intention, which 
then generates events inside the vehicle. We learn the AIO- 
HMM model parameters from natural driving data and dur¬ 
ing inference output the probability of each maneuver. 

We evaluate our approach on a driving data set with 1180 
miles of natural freeway and city driving collected across 
two states - from 10 drivers and with different kinds of driv¬ 
ing maneuvers. We demonstrate that our approach antic¬ 
ipates maneuvers 3.5 seconds before they occur with 80% 
precision and recall. We believe that our work creates scope 
for new ADAS features to make roads safer. In summary 
our key contributions are as follows: 

• We propose an approach for anticipating driving ma¬ 
neuvers several seconds in advance. 

• We model the driving context from inside and outside 
the car with an autoregressive input-output HMM. 

• We release the first data set of natural driving with 
videos from both inside and outside the car, GPS, and 
speed information. 

Our data set and code are available at: http: //www. 
brain4cars.com, 

2. Related Work 

Assistive features for vehicles. Latest cars available in 
market comes equipped with cameras and sensors to mon¬ 
itor the surrounding environment. Through multi-sensory 
fusion they provide assisitive features like lane keeping, for¬ 
ward collision avoidance, adaptive cruise control etc. These 
systems warn drivers when they perform a potentially dan¬ 
gerous maneuver I 3TT371 . Driver monitoring for distraction 
and drowsiness has also been extensively researched (911271. 
Techniques like eye-gaze tracking are now commercially 
available (Seeing Machines Ltd.) and has been effective 
in detecting distraction. Our work complements existing 
ADAS and driver monitoring techniques by anticipating 
maneuvers several seconds before they occur. 

Closely related to us are previous works on predicting the 
driver’s intent. Vehicle trajectory has been used to predict 
the intent for lane change or turn maneuver EHMniEoi. 
Most of these works ignore the rich context available from 
cameras, GPS, and street maps. Trivedi et al. [[36] and Mor¬ 
ris et al. f24l predict lane change intent using the rich con¬ 


text from cameras both inside and outside the vehicle. Both 
works train a discriminative classifier which assumes that 
informative contextual cues always appear at a fixed time 
before the maneuver. We show that this assumption is not 
true, and in fact the temporal aspect of the problem should 
be carefully modeled. Our AIO-HMM takes a generative 
approach and handles the temporal aspect of this problem. 

Anticipation and Modeling Humans. Modeling of human 
motion has given rise to many applications, anticipation be¬ 
ing one of them. Wang et al. (42], Koppula et al. Ifl3lfl5ll . 
and Sener et al. [29] demonstrate better human-robot collab¬ 
oration by anticipating a human’s future movements. Kitani 
et al. fl2l model human navigation in order to anticipate the 
path they will follow. Similar to these works, we anticipate 
human actions, which are driving maneuvers in our case. 
However, the algorithms proposed in the previous works do 
not apply in our setting. In our case, anticipating maneuvers 
requires modeling the interaction between the driving con¬ 
text and the driver’s intention. Such interactions are absent 
in the previous works. We propose AIO-HMM to model 
these aspects of the problem. 

Computer vision for analyzing the human face. The vi¬ 
sion approaches related to our work are face detection and 
tracking GiLE), statistical models of face 0 and pose es¬ 
timation methods for face (44). Active Appearance Model 
(AAM) (6) and its variants (22] 431 statistically model the 
shape and texture of the face. A AMs have also been used to 
estimate the 3D-pose of a face from a single image 144) and 
in design of assistive features for driver monitoring [27,32]. 
In our approach we adapt off-the-shelf available face de¬ 
tection and tracking algorithms for robustness required for 
anticipation (Section [5]). 

Learning temporal models. Temporal models are com¬ 
monly used to model human activities Q2I US EE ED. 
These models have been used in both discriminative and 
generative fashions. The discriminative temporal mod¬ 
els are mostly inspired by the Conditional Random Field 
(CRF) (H) which captures the temporal structure of the 
problem. Wang et al. ATI and Morency et al. 1231 pro¬ 
pose dynamic extensions of the CRF for image segmenta¬ 
tion and gesture recognition respectively. The generative 
approaches for temporal modeling include various filtering 
methods, such as Kalman and particle filters EH, Hidden 
Markov Models, and many types of Dynamic Bayesian Net¬ 
works l25l . Some previous works 0 [Ml [26 ] used HMMs 
to model different aspects of the driver’s behaviour. Most 
of these generative approaches model how latent (hidden) 
states influence the observations. However, in our problem 
both the latent states and the observations influence each 
other. In particular, our AIO-HMM model is inspired by 
the Input-Output HMM 0. In the following sections we 
will explain the advantages of AIO-HMM over HMMs for 
anticipating maneuvers and also compare its performance 
with variants of HMM in the experiments (Section[6]). 
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Figure 2: System Overview. Our system anticipating a left lane change maneuver, (a) We process multi-modal data including GPS, speed, 
street maps, and events inside and outside of the vehicle using video cameras, (b) Vision pipeline extracts visual cues such as driver’s 
head movements, (c) The inside and outside driving context is processed to extract expressive features. (d,e) Using our trained models we 
anticipate the probability of each maneuver. 

3. Problem Overview 

Our goal is to anticipate driving maneuvers a few sec¬ 
onds before they occur. This includes anticipating a lane 
change before the wheels touch the lane markings or antic¬ 
ipating if the driver keeps straight or makes a turn when ap¬ 
proaching an intersection. This is a challenging problem for 
multiple reasons. First, it requires the modeling of context 
from different sources. Information from a single source, 
such as a camera capturing events outside the car, is not suf¬ 
ficiently rich. Additional visual information from within the 
car can also be used. For example, the driver’s head move¬ 
ments are useful for anticipation - drivers typically check 
for the side traffic while changing lanes and scan the cross 
traffic at intersections. 

Second, reasoning about maneuvers should take into ac¬ 
count the driving context at both local and global levels. 

Local context requires modeling events in vehicle’s vicinity 
such as the surrounding vision, GPS, and speed informa¬ 
tion. On the other hand, factors that influence the overall 
route contributes to the global context, such as the driver’s 
final destination. Third, the informative cues necessary for 
anticipation appear at variable times before the maneuver, 
as illustrated in Figure [3]ln particular, the time interval be¬ 
tween the driver’s head movement and the occurrence of 
the maneuver depends on factors such as the speed, traffic 
conditions, the GPS location, etc. 

We obtain the driving context from different sources as 
shown in Figure [2] Our system includes: (1) a driver-facing 
camera inside the vehicle, (2) a road-facing camera outside 
the vehicle, (3) a speed logger, and (4) a GPS and map log¬ 
ger. The information from these sources constitute the driv¬ 
ing context. We use the face camera to track the driver’s 
head movements. The video from the road camera enables 
additional reasoning on maneuvers. For example, when the 
vehicle is in the left-most lane, the only safe maneuvers are 
a right-lane change or keeping straight, unless the vehicle 
is approaching an intersection. Maneuvers also correlate 
with the vehicle’s speed, e.g., turns usually happen at lower 
speeds than lane changes. Additionally, the GPS data aug¬ 
mented with the street map enables us to detect upcoming 
road artifacts such as intersections, highway exits, etc. We 
now describe our model and the learning algorithm. 


Trajectories of 
Right turn facial points 

Figure 3: Variable time occurrence of events. Left: The events 
inside the vehicle before the maneuvers. We track the driver’s face 
along with many facial points. Right: The trajectories generated 
by the horizontal motion of facial points (pixels) ‘t’ seconds before 
the maneuver. X-axis is the time and Y-axis is the pixels’ horizon¬ 
tal coordinates. Informative cues appear during the shaded time 
interval. Such cues occur at variable times before the maneuver, 
and the order in which the cues appear is also important. 

4. Our Approach 

Driving maneuvers are influenced by multiple interac¬ 
tions involving the vehicle, its driver, outside traffic, and oc¬ 
casionally global factors like the driver’s destination. These 
interactions influence the driver’s intention, i.e. their state 
of mind before the maneuver, which is not directly observ¬ 
able. We represent the driver’s intention with discrete states 
that are latent (or hidden). In order to anticipate maneuvers, 
we jointly model the driving context and the latent states in 
a tractable manner. We represent the driving context as a 
set of features, which we describe in Section [5] We now 
present the motivation for our model and then describe the 
model, along with the learning and inference algorithms. 

4.1. Modeling driving maneuvers 

Modeling maneuvers require temporal modeling of the 
driving context (Figure [3j. Discriminative methods, such as 
the Support Vector Machine and the Relevance Vector Ma¬ 
chine ll34lL which do not model the temporal aspect perform 


poorly (shown in Section 6.2). Therefore, a temporal model 
such as the Hidden Markov Model (HMM) is better suited. 
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Figure 4: AIO-HMM. The model has three layers: (i) Input (top): 
this layer represents outside vehicle features X ; (ii) Hidden (mid¬ 
dle): this layer represents driver’s latent states Y ; and (iii) Out¬ 
put (bottom): this layer represents inside vehicle features Z. This 
layer also captures temporal dependencies of inside vehicle fea¬ 
tures. T represents time. 


An HMM models how the driver’s latent states generate 
both the inside driving context and the outside driving con¬ 
text. However, a more accurate model should capture how 
events outside the vehicle (i.e. the outside driving context) 
affect the driver’s state of mind, which then generates the 
observations inside the vehicle (i.e. the inside driving con¬ 
text). Such interactions can be modeled by an Input-Output 
HMM (IOHMM) H. However, modeling the problem with 
IOHMM will not capture the temporal dependencies of the 
inside driving context. These dependencies are critical to 
capture the smooth and temporally correlated behaviours 
such as the driver’s face movements. We therefore present 
Autoregressive Input-Output HMM (AIO-HMM) which ex¬ 
tends IOHMM to model these observation dependencies. 
Figure [4] shows the AIO-HMM graphical model. 

4.2. Modeling Maneuvers with AIO-HMM 

Given T seconds long driving context C before the ma¬ 
neuver M, we learn a generative model for the context 
P(C\M). The driving context C consists of the outside driv¬ 
ing context and the inside driving context. The outside and 
inside contexts are temporal sequences represented by the 
outside features Xf — {Xi, ..,Xk} and the inside fea¬ 
tures = {Zi,Z k} respectively. The corresponding 
sequence of the driver’s latent states is = {Yi, Yk}- 
X and Z are vectors and Y is a discrete state. 

P(C\M) = Y j p(zFxFy 1 k \m) 

Y 1 K 

= P{x?\M)Y,P{zFYi K \xFM) 

Y l K 

cxY^PiZ? -Y^X? ,M) (1) 

fa 

We model the correlations between X , Y and Z with an 
AIO-HMM as shown in Figurejffl The AIO-HMM mod¬ 
els the distribution in equation ([l). It does not assume any 
generative process for the outside features P(X^\M). It 
instead models them in a discriminative manner. The top 
(input) layer of the AIO-HMM consists of outside features 
Xf-. The outside features then affect the driver’s latent 
states Yf, represented by the middle (hidden) layer, which 
then generates the inside features Zf at the bottom (output) 


layer. The events inside the vehicle such as the driver’s head 
movements are temporally correlated because they are gen¬ 
erally smooth. The AIO-HMM handles these dependencies 
with autoregressive connections in the output layer. 

Model Parameters. AIO-HMM has two types of parame¬ 
ters: (i) state transition parameters w; and (ii) observation 
emission parameters I). We use set S to denote the pos¬ 
sible latent states of the driver. For each state Y = i E S, 
we parametrize transition probabilities of leaving the state 
with log-linear functions, and parametrize the output layer 
feature emissions with normal distributions. 

e w ij-X t 

Transition: P(Y t = j\Y t -! = i,X t :w t] ) = — - , A - 

l^ies e 1 1 

Emission: P(Z t \Y t = i, X t , Z t _i; /x it , E*) = Af(Z t \n it , Si) 

The inside (vehicle) features represented by the output 
layer are jointly influenced by all three layers. These inter¬ 
actions are modeled by the mean and variance of the normal 
distribution. We model the mean of the distribution using 
the outside and inside features from the vehicle as follows: 

Pit = (1 + • X t + • Zt-i)pi 

In the equation above, a* and b* are parameters that we 
learn for every state i E S. Therefore, the parameters we 
learn for state i E S are Qi = {fii, a^, b^, and w^- \j E 
5}, and the overall model parameters are 0 = {Oi\i E S}. 

4.3. Learning AIO-HMM parameters 

The training data V = {( X± ™ , Z^™)\n = 1,.., N} con¬ 
sists of N instances of a maneuver M. The goal is to maxi¬ 
mize the data log-likelihood. 

N 

*(©;©) = £ log P(Z& \Xft ; 0) (2) 

n= 1 

Directly optimizing equation ^ is challenging because pa¬ 
rameters Y representing the driver’s states are latent. We 
therefore use the iterative EM procedure to learn the model 
parameters. In EM, instead of directly maximizing equa¬ 
tion CT we maximize its simpler lower bound. We estimate 
the lower bound in the E-step and then maximize that esti¬ 
mate in the M-step. These two steps are repeated iteratively. 

E-step. In the E-step we get the lower bound of equation ^ 
by calculating the expected value of the complete data log- 
likelihood using the current estimate of the parameter 0. 

E-step: Q(0; 0) = E[l c (S;V c ) |0,P] (3) 

where Z c (0; V c ) is the log-likelihood of the complete data 
V c defined as: 

V c = {{X&Z&Yftyn = 1, N} (4) 

N 

v c ) = ]T log P(Z«”, Yfr \x«z ; 0) (5) 

n= 1 

We should note that the occurrences of hidden variables 

Y in / c (0; V c ) are marginalized in equation ([3]), and hence 

Y need not be known. We efficiently estimate Q(0; 0) 
using the forward-backward algorithm |[25l . 

M-step. In the M-step we maximize the expected value of 
















the complete data log-likelihood Q(0; 0) and update the 
model parameter as follows: 

M-step: 0 = argmax@ Q(&; 0) (6) 

Solving equation requires us to optimize for the pa¬ 
rameters pi, a, b, T, and w. We optimize all parameters ex¬ 
pect w exactly by deriving their closed form update expres¬ 
sions. We optimized w using the gradient descent. Refer to 
the supplementary material for detailed E and M steps Q 

4.4. Inference of Maneuvers 

Our learning algorithm trains separate AIO-HMM mod¬ 
els for each maneuver. The goal during inference is to deter¬ 
mine which model best explains the past T seconds of the 
driving context not seen during training. We evaluate the 
likelihood of the inside and outside feature sequences (Z^ 
and X for each maneuver, and anticipate the probability 
Pm of each maneuver M as follows: 

Pm = P{M\Z?,X?) cc P(Z?,X?\M)P(M) (7) 
Algorithm [l] shows the complete inference procedure. 
The inference in equation 0 simply requires a forward- 
pass EH of the AIO-HMM, the complexity of which is 
0(K(\S\ 2 + |<S||Z| 3 + |<S||X|)). However, in practice it 
is only 0(K\S\\Z\ 3 ) because \Z\ 3 >> |S| and \Z\ 3 \X\. 

Here |S| is the number of discrete states representing the 
driver’s intention, while \Z\ and \X\ are the dimensions of 
the inside and outside feature vectors respectively. In equa¬ 
tion ([7} P(M ) is the prior probability of maneuver M. We 
assume an uninformative uniform prior over the maneuvers. 


Algorithm 1 Anticipating maneuvers 

input Driving videos, GPS, Maps and Vehicle Dynamics 
output Probability of each maneuver 

Initialize the face tracker with the driver’s face 
while driving do 

Track the driver’s face f38l 
Extract features Zf and Xf- (Sec. [ 5 ]) 

Inference P M = P(M\Z«, X*)(Eq. 0) 

Send the inferred probability of each maneuver to 
ADAS 
end while 


5. Features 

We extract features by processing the inside and outside 
driving contexts. We denote the inside features with Z and 
the outside features with X. 

5.1. Inside-vehicle features. 

The inside features Z capture the driver’s head move¬ 
ments. Our vision pipeline consists of face detection, track¬ 
ing, and feature extraction modules. We extract head mo¬ 
tion features per-frame, denoted by 0(face). For AIO- 
HMM, we compute Z by aggregating </>(face) for every 20 
frames, i.e., Z = YhLi </>(face i )/|| Ei=i <£(facei)||. 

mttp://www.brain4cars.com/ICCVsupp.pdf 



Figure 5: Inside vehicle feature extraction. The angular his¬ 
togram features extracted at three different time steps for a left 
turn maneuver. Bottom : Trajectories for the horizontal motion 
of tracked facial pixels ‘t’ seconds before the maneuver. At t=5 
seconds before the maneuver the driver is looking straight, at t=3 
looks (left) in the direction of maneuver, and at t=2 looks (right) in 
opposite direction for the crossing traffic. Middle : Average motion 
vector of tracked facial pixels in polar coordinates, r is the aver¬ 
age movement of pixels and arrow indicates the direction in which 
the face moves when looking from the camera. Top : Normalized 
angular histogram features. 

Face detection and tracking. We detect the driver’s face 
using a trained Viola-Jones face detector [ 381. From the de¬ 
tected face, we first extract visually discriminative (facial) 
points using the Shi-Tomasi corner detector l30l and then 
track those facial points using the Kanade-Lucas-Tomasi 
tracker GDEBISl. However, the tracking may accumu¬ 
late errors over time because of changes in illumination 
due to the shadows of trees, traffic, etc. We therefore con¬ 
strain the tracked facial points to follow a projective trans¬ 
formation and remove the incorrectly tracked points using 
the RANSAC algorithm. While tracking the facial points, 
we lose some of the tracked points with every new frame. 
To address this problem, we re-initialize the tracker with 
new discriminative facial points once the number of tracked 
points falls below a threshold CD 

Head motion features. For maneuver anticipation the hori¬ 
zontal movement of the face and its angular rotation (yaw) 
are particularly important. From the face tracking we obtain 
face tracks, which are 2D trajectories of the tracked facial 
points in the image plane. Figure [5] (bottom) shows how the 
horizontal coordinates of the tracked facial points vary with 
time before a left turn maneuver. We represent the driver’s 
face movements and rotations with histogram features. In 
particular, we take matching facial points between succes¬ 
sive frames and create histograms of their corresponding 
horizontal motions (in pixels) and angular motions in the 
image plane (Figure [5j. We bin the horizontal and angu- 































Figure 6: Our data set is diverse in drivers and landscape. 

lar motions using [< —2, —2 to 0, 0 to 2, > 2] and 
[0 to J, | to 7r, 7r to ^ to 2n\, respectively. We 
also calculate the mean movement of the driver’s face cen¬ 
ter. This gives us face) G M 9 facial features per-frame. 
The driver’s eye-gaze is also useful a feature. However, 
robustly estimating 3D eye-gaze in outside environment is 
still a topic of research, and orthogonal to this work on an¬ 
ticipation. We therefore do not consider eye-gaze features. 

5.2. Outside-vehicle features. 

The outside feature vector X encodes the information 
about the outside environment such as the road conditions, 
vehicle dynamics, etc. In order to get this information, we 
use the road-facing camera together with the vehicle’s GPS 
coordinates, its speed, and the street maps. More specif¬ 
ically, we obtain two binary features from the road-facing 
camera indicating whether a lane exists on the left side and 
on the right side of the vehicle. We also augment the ve¬ 
hicle’s GPS coordinates with the street maps and extract a 
binary feature indicating if the vehicle is within 15 meters of 
a road artifact such as intersections, turns, highway exists, 
etc. We also encode the average, maximum, and minimum 
speeds of the vehicle over the last 5 seconds as features. 
This results in a X G M 6 dimensional feature vector. 

6. Experiment 

We first give an overview of our data set, the baseline 
algorithms, and our evaluation setup. We then present the 
results and discussion. Our video demonstration is available 

at: http://www.brain4cars.com, 

6.1. Experimental Setup 

Data set. Our data set consists of natural driving videos 
with both inside and outside views of the car, its speed, and 
the global position system (GPS) coordinates]^] The outside 
car video captures the view of the road ahead. We collected 
this driving data set under fully natural settings without any 
intervention]^] It consists of 1180 miles of freeway and city 
driving and encloses 21,000 square miles across two states. 
We collected this data set from 10 drivers over a period of 

2 The inside and outside cameras operate at 25 and 30 frames/sec. 

3 Protocol: We set up cameras, GPS and speed recording device in 
subject’s personal vehicles and left it to record the data. The subjects were 
asked to ignore our setup and drive as they would normally. 


two months. The complete data set has a total of 2 mil¬ 
lion video frames and includes diverse landscapes. Figure [6] 
shows a few samples from our data set. We annotated the 
driving videos with a total of 700 events containing 274 lane 
changes, 131 turns, and 295 randomly sampled instances of 
driving straight. Each lane change or turn annotation marks 
the start time of the maneuver, i.e., before the car touches 
the lane or yaws, respectively. For all annotated events, 
we also annotated the lane information, i.e., the number of 
lanes on the road and the current lane of the car. 

Baseline algorithms we compare with: 

• Chance: Uniformly randomly anticipates a maneuver. 

• SVM m: Support Vector Machine is a discriminative 
classifier m. Morris et al. 131 takes this approach 
for anticipating maneuvers]^] We train the SVM on 5 
seconds of driving context by concatenating all frame 
features to get a M 3840 dimensional feature vector. 

• Random-Forest m: This is also a discriminative clas¬ 
sifier that learns many decision trees from the training 
data, and at test time it averages the prediction of the 
individual decision trees. We train it on the same fea¬ 
tures as SVM with 150 trees of depth ten each. 

• HMM: This is the Hidden Markov Model. We train 
the HMM on a temporal sequence of feature vectors 
that we extract every 0.8 seconds, i.e., every 20 video 
frames. We consider three versions of the HMM: (i) 
HMM E : with only outside features from the road 
camera, the vehicle’s speed, GPS and street maps (Sec¬ 
tion]^]); (ii) HMM F : with only inside features from 
the driver’s face (Section |5T] ); and (ii) HMM E + F: 
with both inside and outside features. 

We compare these baseline algorithms with our IOHMM 
and AIO-HMM models. The features for our model are ex¬ 
tracted in the same manner as in HMM E + F method. 

Evaluation setup. We evaluate an algorithm based on its 
correctness in predicting future maneuvers. We anticipate 
maneuvers every 0.8 seconds where the algorithm processes 
the recent context and assigns a probability to each of the 
four maneuvers: {left lane change, right lane change, left 
turn, right turn} and a probability to the event of driving 
straight. These five probabilities together sum to one. Af¬ 
ter anticipation, i.e. when the algorithm has computed all 
five probabilities, the algorithm predicts a maneuver if its 
probability is above a threshold. If none of the maneu¬ 
vers’ probabilities are above this threshold, the algorithm 
does not make a maneuver prediction and predicts driving 
straight. However, when it predicts one of the four ma¬ 
neuvers, it sticks with this prediction and makes no further 
predictions for next 5 seconds or until a maneuver occurs, 
whichever happens earlier. After 5 seconds or a maneuver 
has occurred, it returns to anticipating future maneuvers. 


4 Morries et al. El considered binary classification problem (lane 
change vs driving straight) and used RVM ED 




Table 1: Results on our driving data set, showing average precision , recall and time-to-maneuver computed from 5-fold cross-validation. 
The number inside parenthesis is the standard error. 


Algorithm 

Lane change 

Turns 

All maneuvers 

Pr (%) 

Re (%) 

Time-to- 
maneuver (s) 

Pr (%) 

Re (%) 

Time-to- 
maneuver (s) 

Pr (%) 

Re (%) 

Time-to- 
maneuver (s) 

Chance 

33.3 

33.3 

- 

33.3 

33.3 

- 

20.0 

20.0 

- 

Morris et al. (24] SVM 

73.7 (3.4) 

57.8 (2.8) 

2.40 (0.00) 

64.7 (6.5) 

47.2 (7.6) 

2.40 (0.00) 

43.7 (2.4) 

37.7(1.8) 

1.20 (0.00) 

Random-Forest 

71.2 (2.4) 

53.4 (3.2) 

3.00 (0.00) 

68.6 (3.5) 

44.4 (3.5) 

1.20 (0.00) 

51.9(1.6) 

27.7(1.1) 

1.20 (0.00) 

HMM E 

75.0 (2.2) 

60.4 (5.7) 

3.46 (0.08) 

74.4 (0.5) 

66.6 (3.0) 

4.04 (0.05) 

63.9 (2.6) 

60.2 (4.2) 

3.26 (0.01) 

HMM F 

76.4(1.4) 

75.2(1.6) 

3.62 (0.08) 

75.6 (2.7) 

60.1 (1.7) 

3.58 (0.20) 

64.2 (1.5) 

36.8(1.3) 

2.61 (0.11) 

HMM E + F 

80.9 (0.9) 

79.6(1.3) 

3.61 (0.07) 

73.5 (2.2) 

75.3 (3.1) 

4.53 (0.12) 

67.8 (2.0) 

67.7 (2.5) 

3.72 (0.06) 

(Our method) IOHMM 

81.6(1.0) 

79.6 (1.9) 

3.98 (0.08) 

77.6(3.3) 

75.9 (2.5) 

4.42 (0.10) 

74.2 (1.7) 

71.2(1.6) 

3.83 (0.07) 

(Our final method) AIO-HMM 

83.8 (1.3) 

79.2 (2.9) 

3.80 (0.07) 

80.8 (3.4) 

75.2 (2.4) 

4.16(0.11) 

77.4 (2.3) 

71.2 (1.3) 

3.53 (0.06) 
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Figure 7: Confusion matrix of different algorithms when jointly predicting all the maneuvers. Predictions made by algorithms are 
represented by rows, actual maneuvers are represented by columns, and precision on diagonal. 


During this process of anticipation and prediction, the al¬ 
gorithm makes (i) true predictions (tp): when it predicts the 
correct maneuver; (ii) false predictions (fp): when it pre¬ 
dicts a maneuver but the driver performs a different maneu¬ 
ver; (iii) false positive predictions ( fpp ): when it predicts 
a maneuver but the driver does not perform any maneuver 
(i.e. driving straight ); and (iv) missed predictions (mp): 
when it predicts driving straight but the driver performs a 
maneuver. We evaluate the algorithms using their precision 
and recall scores: 

p r = _ — _; Re = _—_ 

tp + fp + fpp ’ tp + fp + mp 

V. ✓ V, ✓ 

V' V' 

Total # of maneuver predictions Total # of maneuvers 

The precision measures the fraction of the predicted maneu¬ 
vers that are correct and recall measures the fraction of the 
maneuvers that are correctly predicted. For true predictions 
(tp) we also compute the average time-to-maneuver, where 
time-to-maneuver is the interval between the time of algo¬ 
rithm’s prediction and the start of the maneuver. 

We perform cross validation to choose the number of the 
driver’s latent states in the AIO-HMM and the threshold on 
probabilities for maneuver prediction. For SVM we cross- 
validate for the parameter C and the choice of kernel from 
Gaussian and polynomial kernels. The parameters are cho¬ 
sen as the ones giving the highest FI-score on a validation 
set. The FI-score is the harmonic mean of the precision and 
recall, defined as FI = 2 * Pr * Fe/(Pr + Re). 

6.2. Results and Discussion 

We evaluate the algorithms on maneuvers that were not 
seen during training and report the results using 5-fold cross 
validation. Table [I] reports the precision and recall scores 
under three settings: (i) Lane change : when the algorithms 
only predict for the left and right lane changes. This setting 
is relevant for highway driving where the prior probabilities 
of turns are low; (ii) Turns : when the algorithms only pre¬ 
dict for the left and right turns; and (iii) All maneuvers : here 
the algorithms jointly predict all four maneuvers. All three 


settings include the instances of driving straight. 

As shown in Table [lj the AIO-HMM performs better 
than the other algorithms. Its precision is over 80% for the 
lane change and turns settings. For jointly predicting all the 
maneuvers its precision is 77%, which is 34% higher than 
the previous work by Morris et al. [24] and 26% higher than 
the Random-Forest. The AIO-HMM recall is always com¬ 
parable or better than the other algorithms. On average the 
AIO-HMM predicts maneuvers 3.5 seconds before they oc¬ 
cur and up to 4 seconds earlier when only predicting turns. 

Figure [7] shows the confusion matrix plots for jointly an¬ 
ticipating all the maneuvers. AIO-HMM gives the highest 
precision for each maneuver. Modeling maneuver antici¬ 
pation with an input-output model allows for a discrimina¬ 
tive modeling of the state transition probabilities using rich 
features from outside the vehicle. On the other hand, the 
HMM E + P solves a harder problem by learning a gener¬ 
ative model of the outside and inside features together. As 
shown in Table[lJ the precision of HMM E + P is 10% less 
than that of AIO-HMM for jointly predicting all the maneu¬ 
vers. AIO-HMM extends IOHMM by modeling the tempo¬ 
ral dependencies of events inside the vehicle. This results 
in better performance: on average AIO-HMM precision is 
3% higher than IOHMM, as shown in Table [I] 

Table [2] compares the fpp of different algorithms. 
False positive predictions (fpp) happen when an algorithm 
wrongly predicts driving straight as one of the maneuvers. 
Therefore low value of fpp is preferred. HMM P performs 
best on this metric at 11% as it mostly assigns a high prob¬ 
ability to driving straight. However, due to this reason, it 
incorrectly predicts driving straight even when maneuvers 
happen. This results in the low recall of HMM P at 36%, 
as shown in Table [I] AIO-HMM’s fpp is 10% less than 
that of IOHMM and HMM P + P. In Figure [9] we com¬ 
pare the FI-scores of different algorithms as the prediction 
threshold varies. We observe that IOHMM and AIO-HMM 






























Table 2: False positive prediction ( fpp ) of different algorithms. 
The number inside parenthesis is the standard error. 


Algorithm 

Lane change 

Turns 

All 

Morris et al. 1241 SYM 

15.3 (0.8) 

13.3 (5.6) 

24.0 (3.5) 

Random-Forest 

16.2 (3.3) 

12.9 (3.7) 

17.5 (4.0) 

HMM E 

36.2 (6.6) 

33.3 (0.0) 

63.8 (9.4) 

HMM F 

23.1 (2.1) 

23.3 (3.1) 

11.5 (0.1) 

HMM E + F 

30.0 (4.8) 

21.2 (3.3) 

40.7 (4.9) 

IOHMM 

28.4(1.5) 

25.0 (0.1) 

40.0(1.5) 

AIO-HMM 

24.6(1.5) 

20.0 (2.0) 

30.7 (3.4) 



Figure 8: Effect of time-to-maneuver. Plot compares FI-scores 
when algorithms predict maneuvers at a fixed time-to-maneuver, 
and shows change in performance as we vary time-to-maneuver. 

perform better than the baseline algorithms and their Fl- 
scores remains stable as we vary the threshold. Therefore, 
the prediction threshold is useful as a parameter to trade-off 
between the precision and recall of algorithms. 

Importance of inside and outside driving context. An 

important aspect of anticipation is the joint modeling of the 
inside and outside driving contexts. HMM F models only 
the inside driving context, while HMM E models only the 
outside driving context. As shown in Table [T] the precision 
and recall values of both models is less than HMM E + F, 
which jointly models both the contexts. More specifically, 
the precision of HMM F on jointly predicting all the ma¬ 
neuvers in 3%, 10%, and 13% less than that of HMM E+F, 
IOHMM, and AIO-HMM, respectively. For HMM E this 
difference is 4%, 11%, and 14% respectively. 

Effect of time-to-maneuver. In Figure [8] we compare Fl- 
scores of the algorithms when they predict maneuvers at 
a fixed time-to-maneuver, and show how the performance 
changes as we vary the time-to-maneuver. As we get closer 
to the start of the maneuvers the FI-scores of the algorithms 
increase. As opposed to this setting, in Table [T] the algo¬ 
rithms predicted maneuvers at the time they were most con¬ 
fident. Under both the fixed and variable time prediction 
settings, the AIO-HMM performs better than the baselines. 

Anticipation complexity. The AIO-HMM anticipates ma¬ 
neuvers every 0.8 seconds using the previous 5 seconds of 
the driving context. The complexity mainly comprises of 
feature extraction and the model inference in equation ([7]). 
Fortunately both these steps can be performed as a dynamic 
program by storing the computation of the most recent an¬ 
ticipation. Therefore, for every anticipation we only pro¬ 
cess the incoming 0.8 seconds and not complete 5 seconds 
of the driving context. Due to dynamic programming the in¬ 
ference complexity described in equation %7},0(K\S\\in 
no longer depends on K and reduces to 0(\S\ |/| 3 ). On av¬ 
erage we predict a maneuver under 3.6 milliseconds on a 
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Figure 9: Effect of prediction threshold. An algorithm makes a 
prediction only when its confidence is above the prediction thresh¬ 
old. Plot shows how the FI-score varies with prediction threshold. 

3.4GHz CPU using MATLAB 2014b on Ubuntu 12.04. 

6.3. Qualitative discussion 

Common Failure Modes. Wrong anticipations can occur 
for different reasons. These include failures in the vision 
pipeline and unmodeled events such as interactions with fel¬ 
low passengers, overtakes, etc. In 6% of the maneuvers, our 
tracker failed due to changes in illumination (in supplemen¬ 
tary we show some instances). Wrong anticipations are also 
common when drivers strongly rely upon their recent mem¬ 
ory of traffic conditions. In such situations visual cues are 
partially available in form of eye movements. Similarly, 
when making turns from turn-only lanes drivers tend not 
to reveal many visual cues. With rich sensory integration, 
such as radar for modeling the traffic, infra-red cameras for 
eye-tracking, along with reasoning about the traffic rules, 
we can further improve the performance. Fortunately, the 
automobile industry has made significant advances in some 
of these areas mmm where our work can apply. Future 
work also includes extending our approach to night driving. 

Prediction timing. In anticipation there is an inherent am¬ 
biguity. Once the algorithm is certain about a maneuver 
above a threshold probability should it predict immediately 
or should it wait for more information? An example of this 
ambiguity is in situations where drivers scan the traffic but 
do not perform a maneuver. In such situations different pre¬ 
diction strategies will result in different performances. 

7. Conclusion 

In this paper we considered the problem of anticipating 
driving maneuvers a few seconds before the driver performs 
them. Our work enables advanced driver assistance systems 
(ADAS) to alert drivers before they perform a dangerous 
maneuver, thereby giving drivers more time to react. We 
proposed an AIO-HMM to jointly model the driver’s inten¬ 
tion and the driving context from both inside and outside of 
the car. Our approach accurately handles both the temporal 
and generative nature of the problem. We extensively eval¬ 
uated on 1180 miles of driving data and showed improve¬ 
ment over many baselines. Our inference takes only a few 
milliseconds therefore it is suited for real-time use. We will 
also publicly release our data set of natural driving. 
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